Co-expression in tissue-specific gene networks links genes in cancer-susceptibility loci to known somatic driver genes

Background The genetic background of cancer remains complex and challenging to integrate. Many somatic mutations within genes are known to cause and drive cancer, while genome-wide association studies (GWAS) of cancer have revealed many germline risk factors associated with cancer. However, the overlap between known somatic driver genes and positional candidate genes from GWAS loci is surprisingly small. We hypothesised that genes from multiple independent cancer GWAS loci should show tissue-specific co-regulation patterns that converge on cancer-specific driver genes. Results We studied recent well-powered GWAS of breast, prostate, colorectal and skin cancer by estimating co-expression between genes and subsequently prioritising genes that show significant co-expression with genes mapping within susceptibility loci from cancer GWAS. We observed that the prioritised genes were strongly enriched for cancer drivers defined by COSMIC, IntOGen and Dietlein et al. The enrichment of known cancer driver genes was most significant when using co-expression networks derived from non-cancer samples of the relevant tissue of origin. Conclusion We show how genes within risk loci identified by cancer GWAS can be linked to known cancer driver genes through tissue-specific co-expression networks. This provides an important explanation for why seemingly unrelated sets of genes that harbour either germline risk factors or somatic mutations can eventually cause the same type of disease. Supplementary Information The online version contains supplementary material available at 10.1186/s12920-024-01941-4.


Supplementary data
Note S1.Quality control and sample selection of recount3 data Note S2.Tissue prediction and per tissue quality control of recount3 data Fig.S1.
Number of cancer-specific drivers and the number shared across tumour types Fig. S2.
Enrichment of cancer-specific driver genes for different PascalX prioritisations Fig. S3.
Enrichment of cancer-specific somatic and germline driver genes for different multi-tissue-and tissue-specific-based gene-prioritisation methods applied to cancer GWAS summary statistics Fig. S4.
Enrichment of cancer-specific driver genes for different tissue-specific matched networks Fig. S5.
Enrichment of cancer-specific somatic driver genes for all combinations of different multi-tissue and tissue-specific gene-prioritisation methods applied to cancer GWAS summary statistics using PoPS.Fig. S6.
Enrichment of COSMIC tier 1 cancer-specific driver genes for different tissuespecific matched networks Fig. S7.
Enrichment of cancer drivers suspected to be tumour suppressors Fig. S8.
Enrichment of cancer drivers suspected to be oncogenes Fig. S9.
Enrichment of cancer drivers exclusive to their tissue of origin for the different prioritisation scores  S1.All GWAS of cancer traits considered in this study Table S2.Somatic cancer driver genes Table S3.PascalX scores Table S4.Downstreamer gene prioritisation Table S5.Gene linkage disequilibrium scores Data S1.Downstreamer gene prioritization (flat text) Data S2.Pathway analysis Note S1.

Quality control and sample selection of recount3 data
Phase 1: We first performed a rough selection of samples using the following steps.
• • Exclude all data from study SRP025982 (mixed tissues and spiked data for benchmarks).
Phase 2: We only retained genes that were expressed in at least 50% of the samples.
Phase 3: We performed another sample QC using only the maintained genes.
• Exclude samples with 0 expression >50% of the genes.
• Use singular value decomposition (SVD) on quantile-normalised expression to remove outliers on the first component.
Phase 4: We corrected the remaining samples for covariates using the following steps.• 675 SRA samples were excluded for missing covariate data.Thus, the total number of samples included was 142,849.
Phase 5: We predicted cell lines and cancer samples.The predictions were based on the sample principal components and trained using the annotations known for a subset of the samples.For the prediction of primary tissues vs cell lines, we used logistic regression using the principal components.
For the prediction of cancer samples, we used the method developed by Fehrmann et al. 1 .This first determines the auto-correlation per component, which is higher for components that reflect copy number alterations.The sample loadings are then used to create a score per sample that indicates the amount of copy number alterations in the samples.We could then use this score in a second logistic regression model that discriminated between primary tissues and cancer samples.
Neither of these models yielded perfect separation between the three classes of samples.While this is in part driven by erroneous annotations in the public repositories, further QC could improve the creation of the tissue-specific subsets.

Note S2. Tissue prediction and per tissue quality control of recount3 data
To predict tissues for the samples that are predicted to not be cell lines or cancerous, we started anew with Transcripts per Million values.We selected the genes expressed in at least 50% of the samples, performed log2 and quantile normalisation and corrected for the same covariates as before.We then performed a new principal component analysis (PCA) and used the components in a multinominal logistic regression model trained on the known sample annotations.
One major confounder with tissue type is the associated study.Typically, samples from the same study are sequenced using the same type of sequencer and read length, and most studies investigate a single tissue.But there are many differences among the different studies.We can correct for these to some extent by including technical differences as confounders, but we found that this adversely affected our prediction accuracy.We therefore devised the following strategy to create a representative training set.Ideally, we would only use a single sample per tissue from each study to train the prediction model.In practise, for some tissues, this would result in a rather limited number of usable samples.To overcome this, we increased the number of samples per tissue per study to ensure at least 50 training samples per tissue.Based on early tests, we noticed that we could not reliably discriminate between adipose and breast samples.These samples were therefore combined in a single adipose-breast network that we refer to as a 'breast' network in this manuscript for clarity.
We then used the R package glmnet 2 to do lasso regression with cross validation to select an optimal lambda.This model was then applied to all samples, and we assigned each sample the tissue with the highest posterior probability.Samples for which the highest posterior probability was less then 0.5 were excluded.
As a final QC, we performed a PCA per tissue and excluded outliers.This resulted in 46,410 samples.Per tissue, we eventually used VST 3 for the normalisation and corrected the data for the covariates.A SVD was used to extract the eigenvectors with gene loadings that are used by Downstreamer for the gene prioritisation.
For the recount3 multi-tissue network, we used quantile normalisation and covariate correction for the 46,410 samples for which we have a predicted tissue assignment.Here we used SVD to obtain the eigenvectors.
Fig. S10.Correlations of Downstreamer Z-scores of all combinations of different multitissue and tissue-specific networks Fig. S11.Downstreamer gene-prioritisation scores from multi-tissue networks versus lossof-function Z-scores indicating gene constraint Fig. S12.Downstreamer gene-prioritisation scores from tissue-specific networks versus loss-of-function Z-scores indicating gene constraint Table

Fig. S1 .
Fig. S1.Number of cancer-specific drivers and the number shared across tumour types